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ABSTRACT 

Motivation: Increased availability of various genotyping techniques 
has initiated a race for finding genetic markers that can be used in 
diagnostics and personalized medicine. Although many genetic risk 
factors are known, key causes of common diseases with complex 
heritage patterns are still unknown. Identification of such complex 
traits requires a targeted study over a large collection of data. 
Ideally, such studies bring together data from many biobanks. 
However, data aggregation on such a large scale raises many privacy 
issues. 

Results: We show how to conduct such studies without violating priv- 
acy of individual donors and without leaking the data to third parties. 
The presented solution has provable security guarantees. 
Contact: jaak.vilo@ut.ee 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

Received on October 12, 2012; revised on January 27, 2013; accepted 
on February 5, 2013 

1 INTRODUCTION 

Genome-wide association studies (GWAS) are one of the driving 
reasons behind the formation of nationwide and privately funded 
gene banks. Many chronic diseases and various cancer types are 
known to have genetic disposition factors (Chakravarti and 
Little, 2003; Lander and Schork, 1994). Although many under- 
lying genetic signatures have been successfully identified for 
Mendelian disorders (Hamosh et aL, 2005), not many genetic 
risk factors for complex diseases have been discovered and con- 
firmed. GWAS have identified some risk factors for type II dia- 
betes (Prokopenko et aL, 2008) and for a few other common 
diseases (Manolio et aL, 2008; Wellcome Trust Case Control 
Consortium, 2007). GWAS have been modestly successful in 
pharmacogenetics (Grant and Hakonarson, 2007) and cancer 
research (Varghese and Easton, 2010). The size and structure 
of a study cohort are the main limiting factors in such studies, 
as the individual impact of genomic differences is usually small. 
Larger sample sizes increase the sensitivity of statistical tests and 
make it possible to apply a wide range of data-mining methods 
(Moore et aL, 2010; Szymczak et aL, 2009). 

Ideally, studies should use nationwide and continent-wide pa- 
tient cohorts. Formation of such cohorts is becoming feasible as 
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genotyping costs are rapidly decreasing (Hay den, 2010; 
Pettersson et aL, 2009). In addition to nationwide biobanks, 
e.g. the UK Biobank, several personal genomics companies, 
such as 23andMe and Navigenics, already possess large and di- 
verse patient cohorts. Biobanks are also forming large collabor- 
ation networks, such as P3G and HuGENet, to combine their 
patient cohorts and improve study quality. Privacy of individual 
gene donors is one of the biggest concerns in such projects. In 
many countries, genotype data are classified as sensitive data that 
can be handled by complying with specific restrictions, e.g. 
HIPAA in the USA and the Data Protection Directive in the 
European Union. These restrictions are justified, as a leak of 
genetic information can cause genome-based discrimination 
when more health-related patterns have been discovered. 

Standard anonymization methods are not applicable to geno- 
type data, as the data themselves are an ultimate identity code. 
Only 30-80 out of 30 million single-nucleotide polymorphisms 
(SNPs) are needed to uniquely identify a person (Lin et aL, 2004). 
Moreover, the size of online genotype databases for genealogy 
studies, such as SGMF and YHRD, has made re-identification 
of anonymized genotype data a real threat (Gymrek et aL, 2013). 
Re-identification attacks (Malin and Sweeney, 2000, 2002) based 
on combining inferred phenotypes with public data become prac- 
tical, as the list of known associations between genotype and 
phenotypic traits (Hindorff et aL, 2009) evolves. Finally, 
Homer et aL (2008) showed that even aggregated pools of gen- 
omic data can leak private information. Although follow-up stu- 
dies (Visscher and Hill, 2009) softened initial claims, the threat 
remains. 

These findings created a debate whether one can promise priv- 
acy of genotype data in consent forms at all (P3G Consortium 
et aL, 2009). In the following, we show how to set up an infra- 
structure where the genotype data can be stored and processed so 
that none of the peers involved in the process can reconstruct the 
data, and thus the risk of accidental leaks and malicious data 
abuse is greatly reduced. The data analysis algorithms are exe- 
cuted in an oblivious manner so that only the desired outcome 
is revealed to the user and nothing else. Differently from 
well-known data perturbation and masking techniques 
(Machanavajjhala et aL, 2007; Sweeney, 2002), security guaran- 
tees are cryptographic. These guarantees depend on the compu- 
tational complexity of well-established mathematical problems 
and not on the background knowledge of potential attackers. 
As such, the presented methodology is applicable to protecting 
biobanks and other medical databases. 
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2 APPLICATION SCENARIOS 

2.1 Privacy threats in medical studies 

There are four principal groups of stakeholders in a typical med- 
ical study: data donors, data collectors, data analysts and super- 
visory organizations. Data donors consent to give their tissue 
samples and record other types of medical data, e.g. various 
questionnaires covering health issues. Data collectors are respon- 
sible for gathering and storing the data and keeping the donors' 
confidentiality. The baseline requirements are forced by the laws, 
e.g. by the Genetic Information Nondiscrimination Act of 2008 
in the USA. However, more stringent privacy guarantees can 
reduce data donors' fears about data abuse and increase the 
participation rate. 

Data collectors and supervisory organizations must guarantee 
that the data analysts (researchers) meet privacy restrictions. The 
confidentiality problem is somewhat smaller if the data analysts 
are from the same institution as the data collectors. If the ana- 
lysts are not part of the organization, usually confidentiality 
agreements are signed between the parties. In most cases, col- 
lected data are stored in several databases so that direct personal 
information is not accessible to researchers. Instead, each patient 
gets a (pseudo) random barcode that links different databases 
together (De Moor et ai, 2003). On rare occasions, databases are 
merged either to identify specific persons or to form datasets 
needed for medical research. 

All such solutions provide only partial security guarantees for 
data donors. Even if genetic data are stored separately and are 
not directly accessible, they must be (partially) released before 
the actual data analysis is carried out. Hence, a single fault by a 
data analyst or an insider attack might effectively obliterate all 
privacy guarantees. Therefore, compromises in confidentiality 
are needed for the creation of worldwide data banks for the 
scientific community. 

These privacy issues are often alleviated with data perturb- 
ation and agglomeration techniques. For instance, the National 
Institutes of Health in the USA published the ratio of SNP alleles 
of various case-control studies (Couzin, 2008) because it is es- 
sentially impossible to split a DNA mixture back to individual 
genotypes. However, it turned out that it is possible to detect 
whether a specific person is in the mixture or not (Homer et ai, 
2008). As case-control groups are often based on sensitive infor- 
mation, the data had to be removed. 

Such unexpected security breaches are common for ad hoc 
perturbation or agglomeration methods because one often over- 
looks the effect of potential background knowledge to security. 
Although for certain problems perturbation and agglomeration 
methods can provide provable security, their overall applicability 
is limited and there are many known impossibility results 
(Dwork, 2011). 

2.2 A novel solution based on distributed storage 

In this work, we propose a data collection system where sensitive 
data are secret shared among several independent entities. In 
brief, secret sharing assures that each party gets completely 
random-looking data. However, when all parties pool the data 
together sensitive values can be restored (see Fig. 1 for an ex- 
ample of how secret sharing works). Depending on the exact 
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Fig. 1. This figure illustrates how players A and B use a 3-out-of-3 addi- 
tive secret sharing scheme to distribute two 32-bit integer values x and y 
to shares. The shares are sent to three servers that use the homomorphic 
property of the scheme to securely compute the sum of x and y. The 
shares of the sum are sent to player C, who reconstructs the result 



nature of the used secret sharing scheme, the privacy of shared 
values is preserved even if some of the parties holding shares 
collude (discussed further in the methods section). Hence, data 
can be shared without the fear of unexpected disclosure. 
Moreover, with the use of specific multi-party computation tech- 
niques, computations on secret shared data can be carried out 
without leaking any information (Bogdanov et ai, 2008; 
Damgard et aL, 2009). As a result, one can deploy a distributed 
computation environment that securely collects the data, does 
oblivious computations and returns the desired end results. 
Such systems have been successfully used for securing auctions 
(Bogetoft et ai, 2009) and analyzing financial data (Bogdanov 
et ai, 2012). 

There are other cryptographically secure computation tech- 
niques, e.g. (fully) homomorphic encryption. However, these 
techniques are significantly slower and less feasible on the large 
genome databases. 

Figure 2 depicts the overall workflow of secure GWAS. The 
core of such a system consists of three or more dedicated data 
centers (hosts) that are assumed to be independent organizations. 
For a worldwide study, these can be biobanks of different coun- 
tries, regulatory authorities and patient interest groups. None of 
them have to be unconditionally trusted as long as too many of 
them do not collude with others. In particular, a successful attack 
against one or two of the data centers leaks no information and 
there even exists a recovery procedure from such attacks. 

Genomic data are entered into the system by primary data 
collectors, e.g. wetlabs or different biobanks who collect and 
process the biological samples and who want to make joint ana- 
lyses on shared data. For this operation, standard clinical proto- 
cols for genotyping are sufficient for security. At the end of the 
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Secure genome-wide association study workflow 
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Fig. 2. Secure GWAS consists of three major stages: data acquisition, formation of case-control groups and statistical testing. Panel (A) depicts how 
these three stages are linked. Data are gathered and sent in securely coded shares to be stored. For statistical tests, case and control info is securely coded 
and applied to the securely stored data so that statistical analyses can be carried out. Panel (B) describes two alternative scenarios that can be used for 
secure storage of genotype and phenotype data. Scenario 1 depicts a situation where genotype data are entered into secure storage by the wetlab and 
phenotype data are entered by the donors themselves. Scenario 2 depicts a case where different gene banks send selected genotype and phenotype data to 
secure storage so they can make joint analyses on more data. Panel (C) describes how case and control groups can be formed. In the simplest case, 
researches have unrestricted access to phenotype data and can thus form case and control groups by themselves. In more complex settings, researchers do 
not have rights to access phenotype data, and hosts must use secure multi-party computations to construct case and control groups based on inclusion 
criteria 



stage, the data are secret shared and transferred securely to the 
hosts. 

Clinical data can be entered into the system either by the data 
donors themselves or by primary data collectors. The data are 
sent to the secure phenotype database in secret shared form to 
ensure confidentiality. In the analysis stage, the data analyst 
specifies the algorithm to be run on the secret shared data and 
waits for the results. For GWAS, the analyst first forms case and 
control groups. Next, the algorithm computes the necessary stat- 
istics and, finally, it releases the loci that show a statistically 
significant differentiation according to the specified case-control 
groups. If the analyst wants to further examine a secured input 
value, all parties hosting the system must agree to disclose the 
respective shares. 

2.3 Potential advantages and drawbacks 

We acknowledge that standard security measures are sufficient 
when the data are collected and analyzed by a single organiza- 
tion. Still, long-term projects can benefit from distributed data 
storage. First, a break-in into a data center yields no usable in- 
formation. Second, splitting the data among independent organ- 
izations gives additional guarantees for the data donors, e.g. 



participants of commercial studies have no way of knowing 
what happens to their data if the project goes bankrupt. If one 
core center belongs to the state, participants know with greater 
certainty that their data cannot be abused. Third, other sources 
of private data can be incorporated into the analysis without 
privacy leaks. In particular, medical institutions can use their 
patient records to enhance analysis. The proposed solution pro- 
vides a way to conduct the analysis so that neither the gene bank 
nor the medical institution releases their data. 

The benefits of our approach are evident in collaborative stu- 
dies between independent biobanks. As nothing beyond the 
desired test results are revealed during the computation, the so- 
lution provides superior privacy guarantees compared with alter- 
natives based on meta-analysis techniques (Wolfson et al, 2010), 
where each biobank first computes local summaries that are col- 
laboratively merged into a final result. As a result, only a few 
summary values are disclosed. However, it is impossible to tell 
what exactly can be inferred about concealed values. Moreover, 
leakages of individual studies can cumulate as in DNA pools, 
where aggregation of minuscule effects on SNP frequencies 
allows us to make strong conclusions. 

The biggest technical drawback of our solution is computa- 
tional efficiency. As the data are secret shared between core 
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centers and each oblivious computation step requires network 
communication, secure algorithms are several magnitudes 
slower than their insecure counterparts. Hence, we set up a con- 
trolled experiment to show that GWAS are feasible in our 
setting. 

Finally, note that legal restrictions can form a major obstacle, 
as secret sharing is not as widely used as data encryption. 
However, legal issues are out of our scope as we analyze only 
the technological feasibility and potential advantages. 



Table 1. Contingency table for the standard x 2 test 



Group 


Allele A 


Allele B 


Cases 


a 


c 


Controls 


b 


d 



Table 2. Contingency table for the Cochran-Armitage test for trend 



3 METHODS 

3.1 Single-point analysis 

The first step in GWAS is splitting individuals into case and control 
groups. These groups are formed based on phenotypic traits, such as 
presence and severity of disease. Although all four nucleotides can be 
located in a SNP site, it is common to consider two alleles, where the 
first one corresponds to the reference sequence and the other represents 
potential mutations. Table 1 depicts the 2 x 2 contingency table for allele 
counts in case and control groups, where an allele is counted twice if it is 
present in both DNA strands (is homozygous). The test statistic for the 
standard x 2 test is expressed as 

(a + b + c + a)(bc-ad) 2 
1 ( a + b)(a + c)(b + d)(c + d) U 

and the test statistic for equiproportionality of the allele A in both groups 



lib + d)(bc - ad) 2 
b(a + c) 2 d 



For reasonable sample sizes, both test statistics are distributed according 
to x 2 distribution with one degree of freedom. See the work of Visscher 
and Hellard (2003) for further details. 

These tests are accurate if the Hardy-Weinberg equilibrium condition 
is satisfied for a particular SNP, whereas the Cochran-Armitage test for 
trend can be used without this assumption. First, one must assemble the 
2x3 contingency table depicted in Table 2 and then compute the tests 
statistic as 



AT [J2 a i( r i m 2- S i m 0] 
^ i=0 



m i mi 



N • J2 &i n i — (12 n i a i) 2 
i=0 i=0 



where the weights a 0 ,ai,a2 are chosen according to the suspected influ- 
ence mechanism (usually a t = i is an appropriate choice). As before, T 3 is 
approximately distributed according to x 2 distribution with one degree of 
freedom (Armitage, 1955; Sasieni, 1997). 

Transmission disequilibrium test (TDT) is applicable only if the data 
consist of parent-child trios. The test measures whether one homozygous 
genotype is more over-represented than the other among affected children 
with heterozygous parents. For that we must first select trios where both 
parents have heterozygous genotype. Let u be the count of AA and v be 
the count of BB genotypes among children. Then the corresponding 
statistic 



T 4 



>-vr 

U + V 



is again approximately distributed according to x 2 distribution with one 
degree of freedom (Spielman et al, 1993). Compared with previous tests, 
TDT is less sensitive to sampling artefacts but it also requires more 
structured data. 



Group 


Allele AA 


Allele AB 


Allele BB 


Total 
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r 0 


r\ 


r 2 
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so 
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m 2 
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n 0 


n x 


n 2 


N 



3.2 Essentials of secure computing using secret sharing 

A secure computation program is similar to a standard computer pro- 
gram. The difference between the two is in how the data are stored and 
processed. In a standard program, all values are processed publicly, 
whereas in a secure computation program, it is possible to specify 
which values are publicly visible and which are stored using techniques 
like secret sharing. These values can be used in computations without 
leaking their contents. 

In our proposed solution, the data are secret shared between three or 
more hosts. The hosts themselves are not able to understand the values 
stored in their databases because each value looks like random noise 
owing to secret sharing. However, it is not trivial to perform computa- 
tions on secret shared values as special secure multi-party computation 
protocols are required. The secure computation protocols used by the 
hosts preserve the privacy of the data during computation. The genotype 
data remain secure as long as the hosts do not share their databases of 
shares with each other. Figure 1 shows how secure multi-party compu- 
tation works with secret sharing. 

Secure computation can be used to perform most data processing op- 
erations. However, current solutions have some important differences 
compared with standard programming: (i) floating-point operations are 
significantly slower; (ii) comparison operations are slower than multipli- 
cation and addition; (iii) parallel execution of several operations is faster 
than sequential execution. Further details can be found in the 
Supplementary Data and in the articles by Bogdanov et al. (2008); 
Damgard et al (2009) and Geisler (2010). 



3.3 Secure storage of genotype data 

Allele-level descriptions of genotypes are commonly stored as sequences 
of pairs where each pair is encoded as AA, AB, BB or NN, corresponding 
to a specific SNP. In GWAS, such data are converted into contingency 
tables as described in Tables 1 and 2 depending on the analysis method 
being used. 

This kind of counting, however, requires the use of string comparison 
operations that we would like to avoid, as they tend to be relatively slow 
in the case of share computing. Therefore, we represent each SNP as a 
pair of integers (A,B), where A counts the occurrences of the first allele 
and B the occurrences of the second allele in that SNP. That is, pairs AA, 
AB, BB, NN are encoded as (2,0), (1,1), (0,2), (0,0), respectively. 
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During the data collection phase, data donors, wetlabs or gene banks 
must first convert genotypes into the form described above and then 
secret share them between data hosts. As a result, we get a secret 
shared database where each column corresponds to an individual 
donor and there are two rows for each SNP (one for A and one for B). 
For each donor, we also store an ID value that uniquely identifies them. 

3.4 Secure formation of case— control groups 

To hide the identities of case and control group members, secret shared 
index vectors are used to specify the groups. An index vector x is a 
zero-one vector, where x ? corresponds to the z'-th column in the database 
and Xi = 1 if and only if the /-th person is a member of the group that 
vector represents. 

The two principal ways of how to construct case-control groups can be 
seen in Figure 2 Panel C. In the first scenario, analysts have direct access 
to clinical data for a set of donors and, thus, are able to select case and 
control groups and construct index vectors. In the second scenario, 
phenotype data are also stored in secret shared form. In this case, there 
exists a secret shared database of phenotype attributes that consists of 
boolean attributes (e.g. has diabetes) and integers (e.g. age, blood pres- 
sure, height). To construct case and control groups, the analyst has to 
specify a logical expression based on which the hosts perform the neces- 
sary comparisons and output the two index vectors in secret shared form. 
For simple inclusion criteria, such as the one we used for experiments 
[(age > 59 A age < 91) A bp > 160 A hasdiabetes = yes], the correspond- 
ing secure comparison protocol is rather efficient. 

3.5 Secure assembly of contingency groups 

Let x and y be index vectors for case and control groups, respectively. Let 
A and B be database columns for a particular SNP. Then the allele counts 
needed for tests (1) and (2) can be expressed as 

n n n n 

a = ^2xiAi b = ^2 c = XiBi d= Yl y fBi - ( 5 ) 

/=i ;=i j=i i=\ 

Data hosts can securely compute shares of a, b, c, d for each SNP. 

For the second contingency table, A(A — B) = 0 for allele combin- 
ations AB, BB and NN. Similarly, B(B - A) = 0 for AA, AB and NN 
and A[4 - (A - B) 2 ] = 0 for AA, BB and NN. We can express counts as 
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-B t ) 
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= X>v4,[4- 
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- (A, - Bjf] 


4.9, 


= X>v4,[4 - 
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(A, - Bd' 


4r 2 




-Ad 


4s 2 


= X>5,(5, - 

i=\ 


-A t ) 



For TDT, we need to construct an index vector z for detecting heterozy- 
gous parents, i.e. z,- = 0 if at least one parent is homozygous and z,- = 1 
otherwise. As A ■ B = 0 for homozygous allele combinations, we compute 

Zi = (^mother " ^mother) • (^father " ^father)- (7) 

To get shares of u and v, we can combine z with counts of AA and BB. 
3.6 Secure statistical testing 

To determine whether a SNP is significant, one must check whether a 
P-value that is associated with a test statistic is below a pre-described 
significance level. Let a be the desired significance level and let T a be such 
that Pr(jf > T a ) = a when T is distributed according to x 2 distribution 
with one degree of freedom. A SNP is significant only if the correspond- 
ing test statistic is T a . 

Direct evaluation of formulae (l)-(4) requires floating-point arith- 
metic, which we would like to avoid. Hence, we must rewrite the equation 



Tj > T a in terms of integer operations. Let T a be represented as a fraction 
p/q and the test statistic as a fraction m/n. Then the condition Tj > T a is 
equivalent to the condition mq > np. Both sides of this inequality can be 
securely computed and the inequality can be evaluated by using secure 
comparison operation, after which we can publish the comparison results 
to find out which SNPs are significant. 

The significance level must be determined considering the multiple 
testing issue. The simplest way is to use Bonferroni correction, which is 
a conservative measure. It is also possible to perform privacy-preserving 
FDR correction. We estimate that the secure version of the standard 
FDR procedure takes about 10 min to complete for 262264 SNPs. 
However, there are alternatives to the original algorithm that are signifi- 
cantly faster; see the Supplementary Data for further details. 

4 RESULTS 

To demonstrate the feasibility of our approach, we used the 
Sharemind multi-party computation platform (Bogdanov 
et al, 2008) to implement core algorithms for GWAS. Our 
choice was mainly motivated by the efficiency and ease of use 
of the Sharemind platform. Alternative platforms [Viff 
(Damgard et al, 2009) and FairplayMP (Ben-David et al, 
2008)] should give similar results. 

We used 270 genotypes from the HapMap project 
(International HapMap Consortium, 2003) measured with the 
Affymetrix Mapping 500K Array as the main data source. In 
each experiment, we divided the data randomly into case and 
control groups and performed genome-wide search for highly 
differentiated SNPs. For that we used cryptographically secure 
counterparts of standard statistical tests used in GWAS: two x 2 
tests for independence (Visscher and Hellard, 2003), Cochran- 
Armitage test for trend (Sasieni, 1997) and TDT (Spielman et ai, 
1993). As our algorithms return exactly the same outputs as 
original algorithms, we report only performance results for vari- 
ous sub- tasks. To show the variability of running times, we 
report the mean and standard deviation of four independent 
runs. 

Each of the donors has 262 264 measured SNPs. First, we ran 
the algorithm on the data of 270 donors, and then we went on to 
test the data of 540, 810 and 1080 donors. We performed the 
experiments on three servers running Sharemind. Each server 
was an off-the-shelf server-grade machine with 48 GB RAM of 
which less was used, twelve 2.93 GHz Intel Xeon (Westmere) 
cores of which two were used and a 1 Gb/s local area network 
(LAN) connection. At the moment, the network connection is 
the bottleneck in terms of algorithm running time; however, 
Sharemind has been successfully used in real applications 
(Bogdanov et ai, 2012). 

The time spent on data acquisition and secure storage does not 
depend on the statistical test used later on. It depends only on the 
number of SNPs and the number of gene donors. The average 
time it takes to encode and share the SNPs for the described case 
can be seen in Table 3. Note that secret sharing and uploading 
data are done only once for each dataset; hence, this is a single- 
time cost. 

The time needed to form case-control groups depends on the 
application scenario. When the analyst has direct access to 
phenotype data and can form case and control groups by her- 
self/himself, then there is no computational overhead. In more 
involved cases, the case and control groups must be constructed 
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Table 3. Performance results for data upload and filtering 



Number of donors 


Upload 


Filtering 


270 donors 


12.0 min 


0.51s 


540 donors 


17.3 min 


0.59s 


810 donors 


23 .2 min 


0.63 s 


1080 donors 


29.4 min 


0.68 s 



based on secret shared phenotype attributes. In this case, the 
overhead depends on the complexity of inclusion criteria for 
case and control groups. Filtering results presented in Table 3 
show that formation of such groups can be done in seconds for 
typical inclusion criteria that consist of simple comparison oper- 
ations mixed with logical conjunctives. 

The time needed to perform the statistical test depends on the 
test, but in all cases it can be broken down into counting allele 
frequencies and evaluating test statistic. As Tables 4 and 5 clearly 
show, the main performance bottleneck is frequency counting, 
which scales linearly w.r.t. the total number of SNP measure- 
ments. As our encoding is optimized for x 2 test, a better encod- 
ing will enhance the performance of the Cochran-Armitage tests 
but not beyond x 2 results. The total duration of the analysis is 
the sum of the frequency analysis and evaluation, as the other 
parts have a negligible duration. 

The presented results clearly show that cryptographically 
secure evaluation of statistical tests on genome -wide scale is prac- 
tically feasible. The expected running time is hours instead of a 
few minutes when computed non-securely. However, the latter is 
not a significant slowdown compared with the time needed to 
acquire the data if the secure analysis method is not used. 

5 DISCUSSION 

Although the results prove the practical feasibility of crypto- 
graphically secure GWAS, the solution is also notably more re- 
source demanding than the alternatives. We have to analyze 
further whether potential benefits outweigh costs. We consider 
three potential application scenarios and contrast our approach 
with the alternatives. 

5.1 Collaboration between hospitals and biobanks 

By now many countries have national or state-funded biobanks. 
There are more than 40 state-governed biobanks in Europe (Zika 
et ai, 2010) and many others in Asia and America (Swede et ai, 
2007). Thus, there is a huge potential for collaborative studies 
between biobanks, hospitals and pharmaceutical companies. For 
example, if a clinical study indicates that a treatment is ineffective 
for a certain group of people, a genome-wide association study 
can indicate whether a particular set of SNPs can be used to 
predict efficacy of a treatment. In particular, GWAS have 
shown that certain SNP mutations influence efficacy of treat- 
ments for asthma, inflammatory bowel disease, coronary heart 
disease and cancer (Grant and Hakonarson, 2007). 

Privacy issues are the major obstacle in such studies: neither 
biobanks nor research institutions can give out data without 



Table 4. Performance results for three different frequency analyses 



Number of donors 


X 2 tests 


Cochran-Armitage 


TDT 


270 donors 


34 ± 5 min 


94 ±2 min 


28 ±6 min 


540 donors 


69 ± 12 min 


204 ±9 min 


58 ±7 min 


810 donors 


102 ± 15 min 


284 ± 16 min 


91 ± 11 min 


1080 donors 


144 ± 34 min 


432 ±50 min 


120 ± 11 min 


1080 donors 


14s 


35 s 


lis 


(non-secure) 









Table 5. Performance results for four test evaluation methods 



Number of donors 


X 2 test 1 


X 2 test 2 


Cochran- 


TDT 








Armitage 




270 donors 


26±3s 


29±3s 


57±7s 


33±8s 


540 donors 


28±5s 


28±3s 


62±4s 


55 ± 31s 


810 donors 


30±4s 


31 ± 3s 


55±5s 


65 ± 29s 


1080 donors 


35 ± 17s 


39 ± 19 s 


63±9s 


46±9s 


1080 donors (non-secure) 


21 ms 


20 ms 


49 ms 


11 ms 



explicit consent from the patients or explicit decision by a rele- 
vant ethics committee. In such scenarios, secure GWAS can be 
used as a pilot study for assessing potential benefits of combined 
studies. In particular, there is no reason to merge the data for 
further analysis if a secure GWAS reveals no differentially ex- 
pressed SNPs. Because such a study can be conducted in a few 
hours, it can significantly speed up pharmaco genetic studies and 
the results can be used by ethics committees to make more sea- 
soned decisions. 

5.2 Servicing study data without privacy breaches 

In many state-funded studies, collected data must be made ac- 
cessible for public use by submitting them to a central repository. 
The NIH example shows a direct publication of GWAS data 
may cause unintended consequences even if the data are pre- 
sented to the public in aggregated form (Couzin, 2008). One 
potential solution is to set up an online web service for conduct- 
ing GWAS: a researcher just posts inclusion criteria for case and 
control groups and all computations are done by the host. In 
most cases, such a solution is adequate without cryptographic 
countermeasures. Privacy issues emerge only if researchers want 
to pool together data from several different repositories to detect 
weak associations or the inclusion criteria must remain private. 
In these cases, secure GWAS methodology can be applied by 
combining methods for international consortium studies with 
phenotype-based filtering (see Fig. 2). 

5.3 Faster and more secure consortium studies 

One of the main hurdles in GWAS is the sample size. For rare 
diseases, there are not enough genotyped patients to form a big 
enough case group. Also, over- and under-representation of 
sub-populations can cause spurious associations. Larger studies 
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involving several international biobanks can diminish the impact 
of such problems. 

Two competing alternatives to our solution in this setting are 
federated database systems with additional security mechanisms 
and hierarchical aggregation of the data. 

A federated database system offers middleware that automat- 
ically binds together several sources and allows users to make 
various queries without knowing how the data are organized. As 
such, a federated database system does not solve privacy issues. 
Hence, one needs an honest broker — a dedicated server with 
strengthened security measures that assembles the data and pro- 
cesses all queries (Boyd et al, 2007). The latter introduces a 
single point of failure — nothing can be done if the security of 
the honest broker is breached. Also, as the data are sent directly 
to the honest broker, all biobanks must have explicit clearance 
for releasing the data. 

Hierarchical data aggregation is applicable when results of 
individual studies can be combined with some meta-analysis 
technique (e.g. Wolfson et al, 2010). In such cases, the honest 
broker must access only aggregated summaries of individual data 
sources to produce the desired result. Consequently, we get 
stronger privacy guarantees, as the broker receives only a limited 
amount of information. However, unexpected privacy breaches 
can still occur because aggregation methods provide no explicit 
security guarantees and it is extremely difficult to assess how 
much information is leaked through summaries. Also, the ap- 
proach cannot be used when members of case and control groups 
must be kept secret. 

In a nutshell, while the alternatives are faster than our solu- 
tion, they are also much more vulnerable to various attacks, and 
thus, privacy concerns can prevent their usage or considerably 
delay the initial setup time. Moreover, hierarchical data aggre- 
gation techniques can be combined with our solution. Namely, 
our solution can be used to replace the honest broker — biobanks 
secret share the aggregated results, and thus, fewer operations 
must be done on shares. For the analysis part of GWAS, 
the resulting hybrid algorithm will only take the time given in 
Table 5 plus a little overhead, making the computation time 
~30s to 1 min as the filtering of case and control groups and 
computation of contingency tables are done locally. 

5.4 Long-term security in a personal genomic project 

The rapid decrease of genotyping costs and moderate success in 
genetic diagnostics have sparkled interest in personal genomics. 
Companies, such as 23andMe, deCODEme and Navigenics, 
offer personalized genotyping services. Although participants 
have a right to withdraw their data at any moment, this right 
is enforced only by physical and organizational methods. As a 
consequence, a single successful outsider or insider attack can 
obsolete all privacy guarantees. Numerous data leakages in 
other areas have shown that this is an irreversible procedure. 
Once data have leaked, there is no way to recall them. On a 
shorter timescale, such events are highly improbable. However, 
such projects need privacy guarantees that last more than 100 
years to protect participants and their offspring. In such settings, 
distributed storage based on secret sharing is one of the best 
cryptographic alternatives, as a successful breach of security of 



a single facility yields no information and it is possible to recover 
from such events. 
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